unintended feature
Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification
Wu, Xuansheng, Yu, Wenhao, Zhai, Xiaoming, Liu, Ninghao
Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impacts of these unintended features toward classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed framework can significantly improve the classifier's generalizability by regularizing those features that are not semantically correlated to each task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. We will release our code and data once accepted.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Law > Criminal Law (0.92)
- Health & Medicine > Consumer Health (0.68)
Multilinguality in LLM-Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness
Parthasarathy, Ambreesh, Subramanian, Chandrasekar, Senrayan, Ganesh, Adappanavar, Shreyash, Taneja, Aparna, Ravindran, Balaraman, Tambe, Milind
Restless Multi-Armed Bandits (RMABs) have been successfully applied to resource allocation problems in a variety of settings, including public health. With the rapid development of powerful large language models (LLMs), they are increasingly used to design reward functions to better match human preferences. Recent work has shown that LLMs can be used to tailor automated allocation decisions to community needs using language prompts. However, this has been studied primarily for English prompts and with a focus on task performance only. This can be an issue since grassroots workers, especially in developing countries like India, prefer to work in local languages, some of which are low-resource. Further, given the nature of the problem, biases along population groups unintended by the user are also undesirable. In this work, we study the effects on both task performance and fairness when the DLM algorithm, a recent work on using LLMs to design reward functions for RMABs, is prompted with non-English language commands. Specifically, we run the model on a synthetic environment for various prompts translated into multiple languages. The prompts themselves vary in complexity. Our results show that the LLM-proposed reward functions are significantly better when prompted in English compared to other languages. We also find that the exact phrasing of the prompt impacts task performance. Further, as prompt complexity increases, performance worsens for all languages; however, it is more robust with English prompts than with lower-resource languages. On the fairness side, we find that low-resource languages and more complex prompts are both highly likely to create unfairness along unintended dimensions.
To Transfer or Not to Transfer: Misclassification Attacks Against Transfer Learned Text Classifiers
Transfer learning --- transferring learned knowledge --- has brought a paradigm shift in the way models are trained. The lucrative benefits of improved accuracy and reduced training time have shown promise in training models with constrained computational resources and fewer training samples. Specifically, publicly available text-based models such as GloVe and BERT that are trained on large corpus of datasets have seen ubiquitous adoption in practice. In this paper, we ask, "can transfer learning in text prediction models be exploited to perform misclassification attacks?" As our main contribution, we present novel attack techniques that utilize unintended features learnt in the teacher (public) model to generate adversarial examples for student (downstream) models. To the best of our knowledge, ours is the first work to show that transfer learning from state-of-the-art word-based and sentence-based teacher models increase the susceptibility of student models to misclassification attacks. First, we propose a novel word-score based attack algorithm for generating adversarial examples against student models trained using context-free word-level embedding model. On binary classification tasks trained using the GloVe teacher model, we achieve an average attack accuracy of 97% for the IMDB Movie Reviews and 80% for the Fake News Detection. For multi-class tasks, we divide the Newsgroup dataset into 6 and 20 classes and achieve an average attack accuracy of 75% and 41% respectively. Next, we present length-based and sentence-based misclassification attacks for the Fake News Detection task trained using a context-aware BERT model and achieve 78% and 39% attack accuracy respectively. Thus, our results motivate the need for designing training techniques that are robust to unintended feature learning, specifically for transfer learned models.
- Media (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Security & Privacy (1.00)